Marketing Campaign Analysis¶

Problem Definition¶

In the context of marketing analytics, customer segmentation has a vital role to play in optimizing ROI.

The Context:¶

  • identifying clusters of customers based on their purchase behavior allows to develop targeted marketing campaigns

The objective:¶

  • Goal is to increase the volume of purchases, respond to all customers needs and increase ROI

The key questions:¶

  • What are the different types of behaviors for purchase as well as demographic profiles of customers included in the proposed dataset?

The problem formulation:¶

  • Objective is by using Unsupervised Learning such as Dimensionality Reduction and Clustering, to come up with the best possible customer segments using the given customer dataset.

Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Import the necessary libraries and load the data¶

In [247]:
#Importing necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

# Importing PCA and t-SNE
from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

%matplotlib inline

# Importing clustering algorithms

!pip install scikit-learn-extra

from sklearn.cluster import KMeans

from sklearn.mixture import GaussianMixture

from sklearn_extra.cluster import KMedoids

from sklearn.cluster import AgglomerativeClustering

from sklearn.cluster import DBSCAN


# Silhouette score
from sklearn.metrics import silhouette_score

import warnings
warnings.filterwarnings("ignore")

from scipy.stats import chi2_contingency
Requirement already satisfied: scikit-learn-extra in /usr/local/lib/python3.10/dist-packages (0.3.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.25.2)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.11.4)
Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.5.0)
In [248]:
#Mounting Google colab
from google.colab import drive
drive.mount("/content/drive")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
#Loading the data
df = pd.read_csv('/content/drive/MyDrive/MIT/Capstone project/marketing_campaign+%284%29.csv')

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
  • Drop the column which has no null values
In [ ]:
#Reading the dataset by looking at the first 5 rows
df.head()
Out[ ]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 10 4 7 0 0 0 0 0 0 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 1 2 5 0 0 0 0 0 0 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 2 10 4 0 0 0 0 0 0 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 0 4 6 0 0 0 0 0 0 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 3 6 5 0 0 0 0 0 0 0

5 rows × 27 columns

In [ ]:
#Understanding the shape of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB
In [ ]:
df.shape
Out[ ]:
(2240, 27)

There are 2240 rows and 27 columns. All data are integers or float except Dt_Customer, Marital Status and Education.

Income has 24 missing values.

There are no other missing values.

Dt_Customer should be a date type.

In [ ]:
#Confirming missing values for Income
df['Income'].isnull().sum()
Out[ ]:
24
In [ ]:
#Looking at how these missing data are reported in the dataset
df[df['Income'].isnull()]
Out[ ]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
10 1994 1983 Graduation Married NaN 1 0 15-11-2013 11 5 ... 0 2 7 0 0 0 0 0 0 0
27 5255 1986 Graduation Single NaN 1 0 20-02-2013 19 5 ... 0 0 1 0 0 0 0 0 0 0
43 7281 1959 PhD Single NaN 0 0 05-11-2013 80 81 ... 3 4 2 0 0 0 0 0 0 0
48 7244 1951 Graduation Single NaN 2 1 01-01-2014 96 48 ... 1 4 6 0 0 0 0 0 0 0
58 8557 1982 Graduation Single NaN 1 0 17-06-2013 57 11 ... 0 3 6 0 0 0 0 0 0 0
71 10629 1973 2n Cycle Married NaN 1 0 14-09-2012 25 25 ... 0 3 8 0 0 0 0 0 0 0
90 8996 1957 PhD Married NaN 2 1 19-11-2012 4 230 ... 2 8 9 0 0 0 0 0 0 0
91 9235 1957 Graduation Single NaN 1 1 27-05-2014 45 7 ... 0 2 7 0 0 0 0 0 0 0
92 5798 1973 Master Together NaN 0 0 23-11-2013 87 445 ... 4 8 1 0 0 0 0 0 0 0
128 8268 1961 PhD Married NaN 0 1 11-07-2013 23 352 ... 1 7 6 0 0 0 0 0 0 0
133 1295 1963 Graduation Married NaN 0 1 11-08-2013 96 231 ... 5 7 4 0 0 0 0 0 0 0
312 2437 1989 Graduation Married NaN 0 0 03-06-2013 69 861 ... 5 12 3 0 1 0 1 0 0 0
319 2863 1970 Graduation Single NaN 1 2 23-08-2013 67 738 ... 3 10 7 0 1 0 1 0 0 0
1379 10475 1970 Master Together NaN 0 1 01-04-2013 39 187 ... 2 6 5 0 0 0 0 0 0 0
1382 2902 1958 Graduation Together NaN 1 1 03-09-2012 87 19 ... 0 3 5 0 0 0 0 0 0 0
1383 4345 1964 2n Cycle Single NaN 1 1 12-01-2014 49 5 ... 0 2 7 0 0 0 0 0 0 0
1386 3769 1972 PhD Together NaN 1 0 02-03-2014 17 25 ... 0 3 7 0 0 0 0 0 0 0
2059 7187 1969 Master Together NaN 1 1 18-05-2013 52 375 ... 10 4 3 0 0 0 0 0 0 0
2061 1612 1981 PhD Single NaN 1 0 31-05-2013 82 23 ... 0 3 6 0 0 0 0 0 0 0
2078 5079 1971 Graduation Married NaN 1 1 03-03-2013 82 71 ... 1 3 8 0 0 0 0 0 0 0
2079 10339 1954 Master Together NaN 0 1 23-06-2013 83 161 ... 1 4 6 0 0 0 0 0 0 0
2081 3117 1955 Graduation Single NaN 0 1 18-10-2013 95 264 ... 1 5 7 0 0 0 0 0 0 0
2084 5250 1943 Master Widow NaN 0 0 30-10-2013 75 532 ... 5 11 1 0 0 1 0 0 0 1
2228 8720 1978 2n Cycle Together NaN 0 0 12-08-2012 53 32 ... 0 1 0 0 1 0 0 0 0 0

24 rows × 27 columns

These 24 rows represent 1% of all data. So I do not expect to introduce any bias by removing these rows.

In [ ]:
#before working on the data and modifying it, I will create the new variable df_copy
df_copy = df.copy(deep=True)
In [ ]:
#Remove these rows

df_copy['Income']= df_copy['Income'].astype(float)
df_copy.dropna(inplace = True)
df_copy.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2216 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2216 non-null   int64  
 1   Year_Birth           2216 non-null   int64  
 2   Education            2216 non-null   object 
 3   Marital_Status       2216 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2216 non-null   int64  
 6   Teenhome             2216 non-null   int64  
 7   Dt_Customer          2216 non-null   object 
 8   Recency              2216 non-null   int64  
 9   MntWines             2216 non-null   int64  
 10  MntFruits            2216 non-null   int64  
 11  MntMeatProducts      2216 non-null   int64  
 12  MntFishProducts      2216 non-null   int64  
 13  MntSweetProducts     2216 non-null   int64  
 14  MntGoldProds         2216 non-null   int64  
 15  NumDealsPurchases    2216 non-null   int64  
 16  NumWebPurchases      2216 non-null   int64  
 17  NumCatalogPurchases  2216 non-null   int64  
 18  NumStorePurchases    2216 non-null   int64  
 19  NumWebVisitsMonth    2216 non-null   int64  
 20  AcceptedCmp3         2216 non-null   int64  
 21  AcceptedCmp4         2216 non-null   int64  
 22  AcceptedCmp5         2216 non-null   int64  
 23  AcceptedCmp1         2216 non-null   int64  
 24  AcceptedCmp2         2216 non-null   int64  
 25  Complain             2216 non-null   int64  
 26  Response             2216 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 484.8+ KB
In [ ]:
# Change Dt_Customer from object to date time type
df_copy['Dt_Customer'] = pd.to_datetime(df_copy['Dt_Customer'], dayfirst= True)
df_copy['Dt_Customer'].head()
Out[ ]:
0   2012-09-04
1   2014-03-08
2   2013-08-21
3   2014-02-10
4   2014-01-19
Name: Dt_Customer, dtype: datetime64[ns]

Dt_Customer is now a datetime format

In [ ]:
#Removing columns that are not useful for the analysis
df_copy.drop(columns = 'ID', axis = 1, inplace = True)
df_copy.head()
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635 88 ... 10 4 7 0 0 0 0 0 0 1
1 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11 1 ... 1 2 5 0 0 0 0 0 0 0
2 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426 49 ... 2 10 4 0 0 0 0 0 0 0
3 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11 4 ... 0 4 6 0 0 0 0 0 0 0
4 1981 PhD Married 58293.0 1 0 2014-01-19 94 173 43 ... 3 6 5 0 0 0 0 0 0 0

5 rows × 26 columns

In [ ]:
#Looking for duplicates
df_copy[df_copy.duplicated()]
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
83 1963 Master Together 38620.0 0 0 2013-05-11 56 112 17 ... 5 3 3 0 0 0 0 0 0 0
179 1951 2n Cycle Married 78497.0 0 0 2013-12-01 44 207 26 ... 7 12 2 0 0 0 1 0 0 0
281 1976 Graduation Together 51369.0 0 1 2012-10-25 84 297 7 ... 2 4 8 0 0 0 0 0 0 0
282 1946 Graduation Together 37760.0 0 0 2012-08-31 20 84 5 ... 1 6 7 0 0 0 0 0 0 0
363 1978 PhD Married 37717.0 1 0 2012-11-23 31 9 0 ... 0 2 9 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2218 1978 Graduation Together 22775.0 1 0 2013-06-19 40 5 1 ... 0 2 8 0 0 0 0 0 0 0
2221 1982 Master Single 75777.0 0 0 2013-07-04 12 712 26 ... 6 11 1 0 1 1 0 0 0 1
2225 1968 Graduation Together 58554.0 1 1 2012-09-26 55 368 24 ... 2 6 7 0 0 0 0 0 0 0
2234 1974 Graduation Married 34421.0 1 0 2013-07-01 81 3 3 ... 0 2 7 0 0 0 0 0 0 0
2236 1946 PhD Together 64014.0 2 1 2014-06-10 56 406 0 ... 2 5 7 0 0 0 1 0 0 0

182 rows × 26 columns

There are 182 duplicated rows. We have to remove duplicates

In [ ]:
# Removing duplicated rows
df_copy.drop_duplicates( keep='first', inplace=True, ignore_index=False)
df_copy.head()
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635 88 ... 10 4 7 0 0 0 0 0 0 1
1 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11 1 ... 1 2 5 0 0 0 0 0 0 0
2 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426 49 ... 2 10 4 0 0 0 0 0 0 0
3 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11 4 ... 0 4 6 0 0 0 0 0 0 0
4 1981 PhD Married 58293.0 1 0 2014-01-19 94 173 43 ... 3 6 5 0 0 0 0 0 0 0

5 rows × 26 columns

In [ ]:
#look at the final shape and data type before the analysis
df_copy.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2034 entries, 0 to 2239
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Year_Birth           2034 non-null   int64         
 1   Education            2034 non-null   object        
 2   Marital_Status       2034 non-null   object        
 3   Income               2034 non-null   float64       
 4   Kidhome              2034 non-null   int64         
 5   Teenhome             2034 non-null   int64         
 6   Dt_Customer          2034 non-null   datetime64[ns]
 7   Recency              2034 non-null   int64         
 8   MntWines             2034 non-null   int64         
 9   MntFruits            2034 non-null   int64         
 10  MntMeatProducts      2034 non-null   int64         
 11  MntFishProducts      2034 non-null   int64         
 12  MntSweetProducts     2034 non-null   int64         
 13  MntGoldProds         2034 non-null   int64         
 14  NumDealsPurchases    2034 non-null   int64         
 15  NumWebPurchases      2034 non-null   int64         
 16  NumCatalogPurchases  2034 non-null   int64         
 17  NumStorePurchases    2034 non-null   int64         
 18  NumWebVisitsMonth    2034 non-null   int64         
 19  AcceptedCmp3         2034 non-null   int64         
 20  AcceptedCmp4         2034 non-null   int64         
 21  AcceptedCmp5         2034 non-null   int64         
 22  AcceptedCmp1         2034 non-null   int64         
 23  AcceptedCmp2         2034 non-null   int64         
 24  Complain             2034 non-null   int64         
 25  Response             2034 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(22), object(2)
memory usage: 429.0+ KB

Observations and Insights from the Data overview:¶

There are 2034 rows, with 26 columns. Datatypes are integers, floats (income and year of birth).

Object data are Marital Status and Education. There are no more missing values or duplicated rows.

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
  2. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
  3. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
  4. There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?
In [ ]:
#Summary statistics of the numerical data
df_copy.describe().T
Out[ ]:
count mean min 25% 50% 75% max std
Year_Birth 2034.0 1968.802852 1893.0 1959.0 1970.0 1977.0 1996.0 11.97542
Income 2034.0 52357.791544 1730.0 35528.25 51533.0 68480.75 666666.0 25526.956988
Kidhome 2034.0 0.44297 0.0 0.0 0.0 1.0 2.0 0.535914
Teenhome 2034.0 0.508358 0.0 0.0 0.0 1.0 2.0 0.54613
Dt_Customer 2034 2013-07-11 14:44:57.345132544 2012-07-30 00:00:00 2013-01-17 00:00:00 2013-07-12 12:00:00 2014-01-02 00:00:00 2014-06-29 00:00:00 NaN
Recency 2034.0 48.844641 0.0 24.0 49.0 74.0 99.0 28.983678
MntWines 2034.0 305.180924 0.0 23.0 175.5 505.0 1493.0 337.323274
MntFruits 2034.0 26.292527 0.0 2.0 8.0 33.0 199.0 39.772885
MntMeatProducts 2034.0 167.76352 0.0 16.0 68.0 230.0 1725.0 226.790697
MntFishProducts 2034.0 37.449361 0.0 3.0 12.0 50.0 259.0 54.792015
MntSweetProducts 2034.0 27.141101 0.0 1.0 8.0 33.75 262.0 41.49443
MntGoldProds 2034.0 43.809735 0.0 9.0 24.0 56.0 321.0 51.741969
NumDealsPurchases 2034.0 2.337758 0.0 1.0 2.0 3.0 15.0 1.933849
NumWebPurchases 2034.0 4.099312 0.0 2.0 4.0 6.0 27.0 2.756911
NumCatalogPurchases 2034.0 2.6706 0.0 0.0 2.0 4.0 28.0 2.937896
NumStorePurchases 2034.0 5.782203 0.0 3.0 5.0 8.0 13.0 3.238853
NumWebVisitsMonth 2034.0 5.322517 0.0 3.0 6.0 7.0 20.0 2.438665
AcceptedCmp3 2034.0 0.074238 0.0 0.0 0.0 0.0 1.0 0.262222
AcceptedCmp4 2034.0 0.076205 0.0 0.0 0.0 0.0 1.0 0.26539
AcceptedCmp5 2034.0 0.072271 0.0 0.0 0.0 0.0 1.0 0.259
AcceptedCmp1 2034.0 0.065388 0.0 0.0 0.0 0.0 1.0 0.247271
AcceptedCmp2 2034.0 0.012291 0.0 0.0 0.0 0.0 1.0 0.110209
Complain 2034.0 0.009833 0.0 0.0 0.0 0.0 1.0 0.098696
Response 2034.0 0.153392 0.0 0.0 0.0 0.0 1.0 0.360454

Year of birth: There are some suspicious data such as the min year 1893. Such an outlier will have to be looked at on boxplots. The birth years range from 1893 to 1996, with a mean birth year of approximately 1968.

Income: The income ranges from 1730 to 666666, mean value is 52357.8, which is not far from the median. However the max income is 666666.0 with a SD 25526.96, which is quite high. it seems there are high incomes.

Kidhome and Teenhome: The number of kids at home ranges from 0 to 2, with a mean of 0.44. Teenhome: The number of teenagers at home ranges from 0 to 2, with a mean of 0.51. People have maximum 2 kids/teens at home

DL customer: Most people have joined in 07/2013. Most recent is 06/2014, oldest is 06/2012.

Recency: Recency: The recency of the last purchase ranges from 0 to 99 days,Last purchase was 49 days before (mean and median)

For all products: There are big variations in the amount of products bought with high SD, which are higher than the median:

MntWines: The amount spent on wine ranges from 0 to 1493, with a mean of 303.94

MntFruits: The amount spent on fruits ranges from 0 to 199, with a mean of 26.30.

MntMeatProducts: The amount spent on meat products ranges from 0 to 1725, with a mean of 166.95.

MntFishProducts: The amount spent on fish products ranges from 0 to 259, with a mean of 37.53.

MntSweetProducts: The amount spent on sweet products ranges from 0 to 263, with a mean of 27.06.

MntGoldProds: The amount spent on gold products ranges from 0 to 362, with a mean of 44.02.

Purchases behaviors::

NumDealsPurchases: The number of deals purchases ranges from 0 to 15, with a mean of 2.33.

NumWebPurchases: The number of web purchases ranges from 0 to 27, with a mean of 4.08.

NumCatalogPurchases: The number of catalog purchases ranges from 0 to 28, with a mean of 2.66.

NumStorePurchases: The number of store purchases ranges from 0 to 13, with a mean of 5.79.

Visite per month: The number of web visits per month ranges from 0 to 20, with a mean of 5.32.

Complain: The number of complaints ranges from 0 to 1, with a mean of 0.01.Most people did not put a complain

Accepted offers: Behave as categorical variables

AcceptedCmp3: The number of accepted campaign 3 offers ranges from 0 to 1, with a mean of 0.07.

AcceptedCmp4: The number of accepted campaign 4 offers ranges from 0 to 1, with a mean of 0.07.

AcceptedCmp5: The number of accepted campaign 5 offers ranges from 0 to 1, with a mean of 0.07.

AcceptedCmp1: The number of accepted campaign 1 offers ranges from 0 to 1, with a mean of 0.06.

AcceptedCmp2: The number of accepted campaign 2 offers ranges from 0 to 1, with a mean of 0.01.

Response: The number of responses ranges from 0 to 1, with a mean of 0.15.

In [ ]:
offers= ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']
for col in offers:
  print('Number of {} '.format (col), 'is', df_copy[col].sum())
Number of AcceptedCmp1  is 133
Number of AcceptedCmp2  is 25
Number of AcceptedCmp3  is 151
Number of AcceptedCmp4  is 155
Number of AcceptedCmp5  is 147
Number of Response  is 312
In [ ]:
# Statistical summary of categorical data
df_copy.describe(include=[object]).T
Out[ ]:
count unique top freq
Education 2034 5 Graduation 1019
Marital_Status 2034 8 Married 788

THere are 5 levels of education, the most frequent one is Graduation.

There are 8 items for Marital Status which might be high. Most frequent one is Married

In [ ]:
#Unique values in categorical data: education
df_copy.Education.unique()
Out[ ]:
array(['Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'], dtype=object)
In [ ]:
# COunting how many people are in each category of Education
df_copy.Education.value_counts()
Out[ ]:
Education
Graduation    1019
PhD            445
Master         336
2n Cycle       185
Basic           49
Name: count, dtype: int64

As 2n Cycle also refers to Master degree in Europe. So we will replace '2n Cycle' by 'Master'

In [ ]:
#Replace 2n Cycle by Master
df_copy['Education'].replace (to_replace = '2n Cycle', value = 'Master', inplace= True)
df_copy.Education.value_counts()
Out[ ]:
Education
Graduation    1019
Master         521
PhD            445
Basic           49
Name: count, dtype: int64

2n cycle has now disappear and Master has now 365 + 200 values

In [ ]:
#Unique values in categorical data: Marital Status
df_copy['Marital_Status'].unique()
Out[ ]:
array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone',
       'Absurd', 'YOLO'], dtype=object)

Absurd and YOLO do not answer the question and should be removed

In [ ]:
#Count how many people are in each category
df_copy['Marital_Status'].value_counts()
Out[ ]:
Marital_Status
Married     788
Together    514
Single      439
Divorced    216
Widow        70
Alone         3
Absurd        2
YOLO          2
Name: count, dtype: int64
In [ ]:
#removing Absurd and Yolo from Marital_Status
remove_list = ['Absurd', 'YOLO']
df_copy = df_copy[~df_copy['Marital_Status'].isin(remove_list)]
df_copy['Marital_Status'].value_counts()
Out[ ]:
Marital_Status
Married     788
Together    514
Single      439
Divorced    216
Widow        70
Alone         3
Name: count, dtype: int64

Absurd and Yolo are not proper answers, so they will be removed from the dataset. Single and alone can be both merged under Single.

Absurd and YOLO have been removed from the list

In [ ]:
#Replace 'Alone' by 'Single'
df_copy['Marital_Status'].replace( to_replace = 'Alone', value= 'Single', inplace= True)
df_copy['Marital_Status'].value_counts()
Out[ ]:
Marital_Status
Married     788
Together    514
Single      442
Divorced    216
Widow        70
Name: count, dtype: int64
In [ ]:
#Unique values in categorical data: Kidhome
df_copy['Kidhome'].value_counts()
Out[ ]:
Kidhome
0    1170
1     819
2      41
Name: count, dtype: int64

Most people do not have kid at home.This is consistent with the age of the population. 819 have 1 child at home and 41 have 2.

In [ ]:
#Unique values in categorical data: Teenhome
df_copy['Teenhome'].value_counts()
Out[ ]:
Teenhome
0    1047
1     934
2      49
Name: count, dtype: int64

Most people do not have Teen at home. A bit more people have a Teen compared to kid at home, also consistent with the age of the population

In [ ]:
#Unique values in categorical data: Complain
df_copy['Complain'].value_counts()
Out[ ]:
Complain
0    2010
1      20
Name: count, dtype: int64

Only 20 people have put a complain

Univariate Analysis on Numerical and Categorical data¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

  • Plot histogram and box plot for different numerical features and understand how the data looks like.
  • Explore the categorical variables like Education, Kidhome, Teenhome, Complain.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. How does the distribution of Income variable vary across the dataset?
  2. The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
  3. There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? At what percentile the upper whisker lies?

Explore and visualise numerical variables

In [ ]:
#Liste of columns with numerical values
df_copy.columns
Out[ ]:
Index(['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Response'],
      dtype='object')
In [ ]:
# work on a new dataframe df_n for numerical values for univariate analysis
df_n = df_copy.copy(deep = True)
In [ ]:
#Remove categorical variables
df_n.drop (columns = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome','Complain','Response', 'AcceptedCmp1', 'AcceptedCmp2','AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Dt_Customer'], axis = 1, inplace = True)
df_n.head()
Out[ ]:
Year_Birth Income Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
0 1957 58138.0 58 635 88 546 172 88 88 3 8 10 4 7
1 1954 46344.0 38 11 1 6 2 1 6 2 1 1 2 5
2 1965 71613.0 26 426 49 127 111 21 42 1 8 2 10 4
3 1984 26646.0 26 11 4 20 10 3 5 2 2 0 4 6
4 1981 58293.0 94 173 43 118 46 27 15 5 5 3 6 5
In [ ]:
#univariate data analysis for numerical values

#creating histograms and box plots for each variable

j=0

for i in df_n:

  fig, ax = plt.subplots(1,2, figsize = (15,4))

  plt.subplot(1,2,1)
  print(i)
  print('Skew :', round(df_n[i].skew(), 2))
  sns.set_style("whitegrid")
  sns.histplot( x = df_n[i], data = df_n, kde= True)
  plt.axvline (x=df_n[i].mean(), color='g', linestyle ='--')
  plt.title(df_n.columns[j])

  plt.subplot(1,2,2)
  sns.boxplot( x = df_n[i], data = df_n)
  plt.title(df_n.columns[j])

  j+=1

  plt.show()
Year_Birth
Skew : -0.38
Income
Skew : 7.06
Recency
Skew : 0.01
MntWines
Skew : 1.17
MntFruits
Skew : 2.11
MntMeatProducts
Skew : 2.05
MntFishProducts
Skew : 1.93
MntSweetProducts
Skew : 2.12
MntGoldProds
Skew : 1.85
NumDealsPurchases
Skew : 2.44
NumWebPurchases
Skew : 1.23
NumCatalogPurchases
Skew : 1.96
NumStorePurchases
Skew : 0.7
NumWebVisitsMonth
Skew : 0.27

Year of birth: There are some outliers, born before 1900, which is suspicious. Bimodal distribution: one mode around 1955 and the other around 1975.

Income has lost of outliers. Normal distribution.

Recency:Uniform distribution. Average of 150 people have visited with a mean of 49 days recency

All products:Have a right skewed distribution with lots of outliers, in the high numbers.

All types of purchases have a right skewed distribution with outliers, except store purchases.

NUmber of store visits: right skewed distribution with outliers.

In [ ]:
#Remove ouliers in income
Q1 = df_n['Income'].quantile(0.25)
Q3 = df_n['Income'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = df_n['Income'].quantile(0.75) + 1.5 *IQR

df_n['Income'] = np.where((df_n['Income'] > upper_bound), upper_bound, df_n['Income'])
sns.boxplot(df_n['Income'])
plt.show()
In [ ]:
#Remove outliers in 'MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds'

liste = ['MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumWebVisitsMonth']

for item in liste:
  Q1 = df_n[item].quantile(0.25)
  Q3 = df_n[item].quantile(0.75)
  IQR = Q3 - Q1
  upper_bound = df_n[item].quantile(0.75) + 1.5 *IQR

  df_n[item] = np.where((df_n[item] > upper_bound), upper_bound, df_n[item])
  figure=plt.figure(figsize = (5,4))
  sns.boxplot(df_n[item])
  plt.show()
In [ ]:
#Remove rows where year of birth is <1900
condition =df_n['Year_Birth'] <=1900
df_n = df_n.drop(df_n[condition].index)
In [ ]:
# Apply all these rules to df_copy

Q1 = df_n['Income'].quantile(0.25)
Q3 = df_n['Income'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = df_n['Income'].quantile(0.75) + 1.5 *IQR

df_copy['Income'] = np.where((df_copy['Income'] > upper_bound), upper_bound, df_copy['Income'])
In [ ]:
liste = ['MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumWebVisitsMonth']

for item in liste:
  Q1 = df_copy[item].quantile(0.25)
  Q3 = df_copy[item].quantile(0.75)
  IQR = Q3 - Q1
  upper_bound = df_copy[item].quantile(0.75) + 1.5 *IQR

  df_copy[item] = np.where((df_copy[item] > upper_bound), upper_bound, df_copy[item])
In [ ]:
#Remove rows where year of birth is <1900
condition =df_copy['Year_Birth'] <=1900
df_copy = df_copy.drop(df_copy[condition].index)

Explore and visualize categorical variables

In [ ]:
#Separate univariate analysis for cmp1, cmp2, cmp3, cmp4, cmp5 and Response.
df_cmp = df_copy[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']]
df_cmp.head()
Out[ ]:
AcceptedCmp1 AcceptedCmp2 AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 Response
0 0 0 0 0 0 1
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
In [ ]:
df_cmp.sum()
Out[ ]:
AcceptedCmp1    132
AcceptedCmp2     25
AcceptedCmp3    151
AcceptedCmp4    155
AcceptedCmp5    145
Response        310
dtype: int64
In [ ]:
# Create a dataframe for visualization
df_cmp_tot = pd.DataFrame({'cmp':['AcceptedCmp1', 'AcceptedCmp2','AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response'], 'values': [132, 25, 151, 155, 145, 310]})
df_cmp_tot
Out[ ]:
cmp values
0 AcceptedCmp1 132
1 AcceptedCmp2 25
2 AcceptedCmp3 151
3 AcceptedCmp4 155
4 AcceptedCmp5 145
5 Response 310
In [ ]:
#create graph
print('AcceptedCmp 1, 2, 3, 4, 5 and last')
sns.barplot(x = 'cmp', y = 'values', data = df_cmp_tot, palette = 'Set1', hue = 'cmp')
plt.xticks(rotation = 60)
plt.show()
AcceptedCmp 1, 2, 3, 4, 5 and last

Most people have accepted the last offers. Very few accepted the second one

In [ ]:
#Explore complains
df_copy['Complain'].value_counts()
Out[ ]:
Complain
0    2008
1      19
Name: count, dtype: int64

There have been 19 complains, 1%. So they will probably be dropped for the analysis

In [ ]:
# Explore Education
df_copy['Education'].value_counts()
Out[ ]:
Education
Graduation    1018
Master         518
PhD            442
Basic           49
Name: count, dtype: int64
In [ ]:
# Explore Education: graph
order = ['Basic', 'Graduation', 'Master', 'PhD']
sns.countplot(df_copy['Education'], palette = 'Set1', order = order)
Out[ ]:
<Axes: xlabel='count', ylabel='Education'>
In [ ]:
#Explore Marital status
sns.countplot(df_copy['Marital_Status'],palette = 'Set1')
Out[ ]:
<Axes: xlabel='count', ylabel='Marital_Status'>
In [ ]:
#Explore kidhome
df_copy['Kidhome'].value_counts()
Out[ ]:
Kidhome
0    1168
1     818
2      41
Name: count, dtype: int64
In [ ]:
sns.violinplot(df_copy['Kidhome'])
Out[ ]:
<Axes: ylabel='Kidhome'>

This graph shows well the spread of people having 0, 1 or 2 kids at home

In [ ]:
#Explore Teenhome
df_copy['Teenhome'].value_counts()
Out[ ]:
Teenhome
0    1045
1     933
2      49
Name: count, dtype: int64
In [ ]:
sns.violinplot(df_copy['Teenhome'])
Out[ ]:
<Axes: ylabel='Teenhome'>

This graph shows well the spread of people having 0, 1 or 2 Teens at home

In [ ]:
# Explore Dt_Customer
df_copy['Dt_Customer'] = pd.to_datetime(df_copy['Dt_Customer'], format='%d-%m-%Y')

# Extract the year and month for plotting
df_copy['Year_Month'] = df_copy['Dt_Customer'].dt.to_period('M')

# Plot the count of customers by 'Dt_Customer'
plt.figure(figsize=(12, 4))
sns.countplot(x='Year_Month', data=df_copy, palette = 'Set1')
plt.xticks(rotation=90)
plt.title('Count of Customers by Dt_Customer')
plt.xlabel('Year-Month')
plt.ylabel('Count')
plt.show()

Bivariate Analysis¶

  • Analyze different categorical and numerical variables and check how different variables are related to each other.
  • Check the relationship of numerical variables with categorical variables.

Bivariate analysis between numerical variables

In [ ]:
sns.pairplot(df_n, corner= True)
Output hidden; open in https://colab.research.google.com to view.
In [ ]:
#Multivariate analysis
# correlation between all variables: bivariate analysis for numerical values

fig = plt.figure(figsize=(17,5))

sns.heatmap(df_copy[['Year_Birth', 'Income', 'Kidhome',
       'Teenhome', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts',
       'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
       'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases',
       'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3',
       'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2',
       'Complain', 'Response']].corr(), annot = True, cmap='coolwarm', vmin=-1, vmax=1, annot_kws={'size': 8})
plt.xticks(rotation = 60)
plt.title('Heatmap for numerical values')
plt.show()
In [ ]:
#Multivariate analysis
# correlation between all variables: bivariate analysis for numerical values, removing categorical variables, to focus on significant data
fig = plt.figure(figsize=(17,5))

sns.heatmap(df_n.corr(), annot = True, cmap='coolwarm', vmin=-1, vmax=1)
plt.xticks(rotation = 60)
plt.show()

Significant correlations:

Income is positively correlated with all products purchases, Web, Catalog and store purchases. It is negatively correlated with number of Web visits per month and deals purchases. Higher income seem to buy more with less visits online. When they visit, they probably buy. It is negatively correlated with year of birth: younger people earn less

Recencyis not correlated with any other variable.

All products purchases are positively correlated with income, other products purchases, Web, Catalog and store purchases, negatively correlated with number of Web visits per months, and year of birth. Meat and Wines pruchases are particularly lighly correlated

Deal purchases are positively correlated with number of web visits per month, Web purchases. It is negatively correlated with Income. It is not correlated with any specific product.

Web purchases are positively correlated with store and catalog purchases. It is also positively correlated with deals purchases and all types pf products purchases.It is positively correlated with income, and negatively correlated with year of birth (older people).

Catalog purchases: are positively correlated with store and Web purchases. It is also positively correlated with deals purchases and all types of products purchases.It is positively correlated with income, and negatively correlated with year of birth (older people)and number of web visits.

Store purchases: same profile as catalog purchases

Number of web visits per month are negatively correlated with store and catalog purchases, and posiitvely correlated with deals purchases. Negatively correlated with income, all type of products purchases.Positively correlated with year of birth. So younger people visit the we more often.

Bivariate analysis between categorical and numerical variables

In [ ]:
#correlation between categorical variables and numerical variables
#liste for categorical values
list_cat = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome','Complain']

#Dataframe for categorical values
df_cat = df_copy[list_cat]
df_cat.head()
Out[ ]:
Education Marital_Status Kidhome Teenhome Complain
0 Graduation Single 0 0 0
1 Graduation Single 1 1 0
2 Graduation Together 0 0 0
3 Graduation Together 1 0 0
4 PhD Married 1 0 0
In [ ]:
columns_num=['Year_Birth','Income', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts',
       'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
       'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases',
       'NumStorePurchases', 'NumWebVisitsMonth']

columns_cat = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome','Complain','Response', 'AcceptedCmp3',
       'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2']

#figure  = plt.figure(figsize = )


for num in columns_num:

  fig, ax = plt.subplots(3,5, figsize= (15,10))
  fig.suptitle('Correlation between {} and categorical variables'.format(num))
  plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1, wspace=0.4, hspace=0.5)


  for i, cat in enumerate (columns_cat):

    plt.subplot(3,5,i+1)
    sns.boxplot(x=df_copy[cat], y= df_copy[num], data = df_copy, palette = 'Set1', hue =df_copy[cat], legend = False )
    plt.xticks(rotation=60, fontsize = 8)


plt.show()

higher level of education is correlated with older people, higher income.No other correlation with all other numerical variables

Marital Status is not associated with any other numerical variables

No kid at home is associated with higher income, more purchases of all products, more catalog, store and web purchases. Less web visits.

Teenhome:younger people have no Teen at home. Older people have 1 or 2 teens at home. Maybe this could be grand children. More teens at home is correlated with more deals purchases.

No correlation for complains

Accepted 1rst offer: associated with higher income, all types of products purchases, more web, catalog and store purchases, less deals purchases and less web visits.

Similar profile for accepted 5th offer. Other offers are less significantly correlated with numerical variables

In [ ]:
#Correlation between catergorical variables
#The most interesting one is between cmpvalues and the other categorical variables

col1 = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']
cmps= ['Response', 'AcceptedCmp3','AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2']

for i,col in enumerate(col1):
  for cmp in cmps:
   #Create a crosstab to count occurrences
    crosstab = pd.crosstab(df_copy[cmp], df_copy[col])

    # Perform Chi-Square Test of Independence
    chi2, p, dof, expected = chi2_contingency(crosstab)

    # Print the p-value
    print('The p-value is', p)

  # Plot the heatmap
    plt.figure(figsize=(8, 4))
    sns.heatmap(crosstab, annot=True, cmap='coolwarm', fmt='d')
    plt.title('Heatmap of offers vs {}'.format(col))
    plt.show()
The p-value is 0.0006882643736489826
The p-value is 0.7929217934720838
The p-value is 0.07044817148158672
The p-value is 0.22533866855409893
The p-value is 0.18419305464215485
The p-value is 0.28313668071216286
The p-value is 1.7710435205233267e-09
The p-value is 0.5981493284073702
The p-value is 0.2602413936046153
The p-value is 0.7251850536413036
The p-value is 0.5740442729004149
The p-value is 0.27748848752404426
The p-value is 0.0012181729836805268
The p-value is 0.26064149541943094
The p-value is 3.2836532900200857e-13
The p-value is 4.845384418252758e-21
The p-value is 2.5390823754361973e-17
The p-value is 0.0004827227439292986
The p-value is 9.966066901065126e-13
The p-value is 0.07164104547941785
The p-value is 0.1288177490081988
The p-value is 1.1512583969875174e-18
The p-value is 7.033087178903188e-11
The p-value is 0.34905093015180316

There is correlation between response to last offer and education, marital status and teenhome

There is correlation between response to the 5th offerand Kidhome and Teenhome

There is a correlation between response to the 2nd, and 4th offer and Kidhome

In [ ]:
#Also look at complain and other categorical values
col1 = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']

for i,col in enumerate(col1):
   #Create a crosstab to count occurrences
    crosstab = pd.crosstab(df_cat['Complain'],df_cat[col])
    # Perform Chi-Square Test of Independence
    chi2, p, dof, expected = chi2_contingency(crosstab)

    # Print the p-value
    print('The p-value is', p)

  # Plot the heatmap
    plt.figure(figsize=(8, 4))
    sns.heatmap(crosstab, annot=True, cmap='coolwarm', fmt='d')
    plt.title('Heatmap of complain vs {}'.format(col))
    plt.show()
The p-value is 0.24675241871574988
The p-value is 0.7252099715190983
The p-value is 0.2845912205020152
The p-value is 0.7008190509168353

No correlation for complain with other categorical variables

Feature Engineering and Data Processing¶

In this section, we will first prepare our dataset for analysis.

  • Imputing missing values

Think About It:

  • Can we extract the age of each customer and create a new feature?
  • Can we find the total kids and teens in the home?
  • Can we find out how many members each family has?
  • Can we find the total amount spent by the customers on various products?
  • Can we find out how long the customer has been with the company?
  • Can we find out how many offers the customers have accepted?
  • Can we find out amount spent per purchase?
In [ ]:
# Create new feature with age of each customer

now = pd.Timestamp('now')
df_copy['Customer_age'] = (2024 - df_copy['Year_Birth'])
df_copy['Customer_age'] = np.round(df_copy['Customer_age'],2)
df_copy.head()
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Year_Month Customer_age
0 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635.0 79.5 ... 7.0 0 0 0 0 0 0 1 2012-09 67
1 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11.0 1.0 ... 5.0 0 0 0 0 0 0 0 2014-03 70
2 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426.0 49.0 ... 4.0 0 0 0 0 0 0 0 2013-08 59
3 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11.0 4.0 ... 6.0 0 0 0 0 0 0 0 2014-02 40
4 1981 PhD Married 58293.0 1 0 2014-01-19 94 173.0 43.0 ... 5.0 0 0 0 0 0 0 0 2014-01 43

5 rows × 28 columns

In [ ]:
#total kids and teens in the home
df_copy['Total_kids'] = df_copy['Kidhome'] + df_copy['Teenhome']
df_copy.head()
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Year_Month Customer_age Total_kids
0 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635.0 79.5 ... 0 0 0 0 0 0 1 2012-09 67 0
1 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11.0 1.0 ... 0 0 0 0 0 0 0 2014-03 70 2
2 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426.0 49.0 ... 0 0 0 0 0 0 0 2013-08 59 0
3 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11.0 4.0 ... 0 0 0 0 0 0 0 2014-02 40 1
4 1981 PhD Married 58293.0 1 0 2014-01-19 94 173.0 43.0 ... 0 0 0 0 0 0 0 2014-01 43 1

5 rows × 29 columns

In [ ]:
#Total number of family members each family has

#Numer of parents at home
df_copy['N-Parents'] = 0

for i,col in enumerate(df_copy['Marital_Status']):
  if col == 'Together' or col =='Married':
    df_copy['N-Parents'].iloc[i]= 2

  else:
    df_copy['N-Parents'].iloc[i]= 1

df_copy['Family_Size'] = df_copy['Total_kids'] + df_copy['N-Parents']
df_copy.drop(['N-Parents'], axis = 1, inplace = True)
df_copy.head()
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Year_Month Customer_age Total_kids Family_Size
0 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635.0 79.5 ... 0 0 0 0 0 1 2012-09 67 0 1
1 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11.0 1.0 ... 0 0 0 0 0 0 2014-03 70 2 3
2 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426.0 49.0 ... 0 0 0 0 0 0 2013-08 59 0 2
3 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11.0 4.0 ... 0 0 0 0 0 0 2014-02 40 1 3
4 1981 PhD Married 58293.0 1 0 2014-01-19 94 173.0 43.0 ... 0 0 0 0 0 0 2014-01 43 1 3

5 rows × 30 columns

In [ ]:
# how long the customer has been with the company?
#already included in main data set as Time_since_enrol: was created when data type was sorted
df_copy['Dt_Customer'] = pd.to_datetime(df_copy['Dt_Customer'], dayfirst= True)
now = pd.Timestamp('now')
df_copy['Time_since_enrol'] = (now - df_copy['Dt_Customer']).dt.days/365.25
df_copy['Time_since_enrol'] = np.round(df_copy['Time_since_enrol'],2)
df_copy['Time_since_enrol'].head()
Out[ ]:
0    11.75
1    10.25
2    10.79
3    10.32
4    10.38
Name: Time_since_enrol, dtype: float64
In [ ]:
# total amount spent by the customers on various products?

df_copy['Total_spendings'] = df_copy['MntFishProducts'] + df['MntFruits'] + df_copy['MntGoldProds'] +df_copy['MntMeatProducts'] +df_copy['MntSweetProducts'] + df_copy['MntWines']
df_copy.head()
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... AcceptedCmp1 AcceptedCmp2 Complain Response Year_Month Customer_age Total_kids Family_Size Time_since_enrol Total_spendings
0 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635.0 79.5 ... 0 0 0 1 2012-09 67 0 1 11.75 1560.375
1 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11.0 1.0 ... 0 0 0 0 2014-03 70 2 3 10.25 27.000
2 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426.0 49.0 ... 0 0 0 0 2013-08 59 0 2 10.79 776.000
3 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11.0 4.0 ... 0 0 0 0 2014-02 40 1 3 10.32 53.000
4 1981 PhD Married 58293.0 1 0 2014-01-19 94 173.0 43.0 ... 0 0 0 0 2014-01 43 1 3 10.38 422.000

5 rows × 32 columns

In [ ]:
#how many offers the customers have accepted

df_copy['Total_offers'] = df_copy['AcceptedCmp1'] +df_copy['AcceptedCmp2'] + df_copy['AcceptedCmp3']  + df_copy['AcceptedCmp4'] + df_copy['AcceptedCmp5'] + df_copy['Response']
df_copy.head()
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... AcceptedCmp2 Complain Response Year_Month Customer_age Total_kids Family_Size Time_since_enrol Total_spendings Total_offers
0 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635.0 79.5 ... 0 0 1 2012-09 67 0 1 11.75 1560.375 1
1 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11.0 1.0 ... 0 0 0 2014-03 70 2 3 10.25 27.000 0
2 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426.0 49.0 ... 0 0 0 2013-08 59 0 2 10.79 776.000 0
3 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11.0 4.0 ... 0 0 0 2014-02 40 1 3 10.32 53.000 0
4 1981 PhD Married 58293.0 1 0 2014-01-19 94 173.0 43.0 ... 0 0 0 2014-01 43 1 3 10.38 422.000 0

5 rows × 33 columns

In [ ]:
#amount spent per purchase

df_copy['Total_n_purchases'] = df_copy['NumCatalogPurchases'] + df_copy['NumWebPurchases'] + df_copy['NumStorePurchases'] + df_copy['NumDealsPurchases']

df_copy['Amount_per_purchase'] = df_copy['Total_spendings'] / df_copy['Total_n_purchases']
df_copy['Amount_per_purchase'] = np.round(df_copy['Amount_per_purchase'],2)

df_copy.drop('Total_n_purchases', axis = 1, inplace = True)

df_copy.head()
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... Complain Response Year_Month Customer_age Total_kids Family_Size Time_since_enrol Total_spendings Total_offers Amount_per_purchase
0 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635.0 79.5 ... 0 1 2012-09 67 0 1 11.75 1560.375 1 62.42
1 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11.0 1.0 ... 0 0 2014-03 70 2 3 10.25 27.000 0 4.50
2 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426.0 49.0 ... 0 0 2013-08 59 0 2 10.79 776.000 0 36.95
3 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11.0 4.0 ... 0 0 2014-02 40 1 3 10.32 53.000 0 6.62
4 1981 PhD Married 58293.0 1 0 2014-01-19 94 173.0 43.0 ... 0 0 2014-01 43 1 3 10.38 422.000 0 22.21

5 rows × 34 columns

In [ ]:
#looking for missing values
df_copy.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2027 entries, 0 to 2239
Data columns (total 34 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Year_Birth           2027 non-null   int64         
 1   Education            2027 non-null   object        
 2   Marital_Status       2027 non-null   object        
 3   Income               2027 non-null   float64       
 4   Kidhome              2027 non-null   int64         
 5   Teenhome             2027 non-null   int64         
 6   Dt_Customer          2027 non-null   datetime64[ns]
 7   Recency              2027 non-null   int64         
 8   MntWines             2027 non-null   float64       
 9   MntFruits            2027 non-null   float64       
 10  MntMeatProducts      2027 non-null   float64       
 11  MntFishProducts      2027 non-null   float64       
 12  MntSweetProducts     2027 non-null   float64       
 13  MntGoldProds         2027 non-null   float64       
 14  NumDealsPurchases    2027 non-null   float64       
 15  NumWebPurchases      2027 non-null   float64       
 16  NumCatalogPurchases  2027 non-null   float64       
 17  NumStorePurchases    2027 non-null   int64         
 18  NumWebVisitsMonth    2027 non-null   float64       
 19  AcceptedCmp3         2027 non-null   int64         
 20  AcceptedCmp4         2027 non-null   int64         
 21  AcceptedCmp5         2027 non-null   int64         
 22  AcceptedCmp1         2027 non-null   int64         
 23  AcceptedCmp2         2027 non-null   int64         
 24  Complain             2027 non-null   int64         
 25  Response             2027 non-null   int64         
 26  Year_Month           2027 non-null   period[M]     
 27  Customer_age         2027 non-null   int64         
 28  Total_kids           2027 non-null   int64         
 29  Family_Size          2027 non-null   int64         
 30  Time_since_enrol     2027 non-null   float64       
 31  Total_spendings      2027 non-null   float64       
 32  Total_offers         2027 non-null   int64         
 33  Amount_per_purchase  2027 non-null   float64       
dtypes: datetime64[ns](1), float64(14), int64(16), object(2), period[M](1)
memory usage: 618.8+ KB
In [ ]:
#looking for inf values
df_copy[df_copy['Amount_per_purchase']==np.inf]
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... Complain Response Year_Month Customer_age Total_kids Family_Size Time_since_enrol Total_spendings Total_offers Amount_per_purchase
655 1975 Graduation Divorced 118043.5 0 0 2014-02-07 81 1.0 1.0 ... 0 0 2014-02 49 0 1 10.32 6.0 0 inf
981 1965 Graduation Divorced 4861.0 0 0 2014-06-22 20 2.0 1.0 ... 0 0 2014-06 59 0 1 9.95 6.0 0 inf
1524 1973 Graduation Single 3502.0 1 0 2013-04-13 56 2.0 1.0 ... 0 0 2013-04 51 1 2 11.15 5.0 0 inf
2132 1949 PhD Married 118043.5 0 0 2013-08-29 85 2.0 1.0 ... 0 0 2013-08 75 0 2 10.77 8.0 0 inf

4 rows × 34 columns

In [ ]:
#drop these rows
df_copy = df_copy[~df_copy['Amount_per_purchase'].isin([np.inf, -np.inf])]
df_copy.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2023 entries, 0 to 2239
Data columns (total 34 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Year_Birth           2023 non-null   int64         
 1   Education            2023 non-null   object        
 2   Marital_Status       2023 non-null   object        
 3   Income               2023 non-null   float64       
 4   Kidhome              2023 non-null   int64         
 5   Teenhome             2023 non-null   int64         
 6   Dt_Customer          2023 non-null   datetime64[ns]
 7   Recency              2023 non-null   int64         
 8   MntWines             2023 non-null   float64       
 9   MntFruits            2023 non-null   float64       
 10  MntMeatProducts      2023 non-null   float64       
 11  MntFishProducts      2023 non-null   float64       
 12  MntSweetProducts     2023 non-null   float64       
 13  MntGoldProds         2023 non-null   float64       
 14  NumDealsPurchases    2023 non-null   float64       
 15  NumWebPurchases      2023 non-null   float64       
 16  NumCatalogPurchases  2023 non-null   float64       
 17  NumStorePurchases    2023 non-null   int64         
 18  NumWebVisitsMonth    2023 non-null   float64       
 19  AcceptedCmp3         2023 non-null   int64         
 20  AcceptedCmp4         2023 non-null   int64         
 21  AcceptedCmp5         2023 non-null   int64         
 22  AcceptedCmp1         2023 non-null   int64         
 23  AcceptedCmp2         2023 non-null   int64         
 24  Complain             2023 non-null   int64         
 25  Response             2023 non-null   int64         
 26  Year_Month           2023 non-null   period[M]     
 27  Customer_age         2023 non-null   int64         
 28  Total_kids           2023 non-null   int64         
 29  Family_Size          2023 non-null   int64         
 30  Time_since_enrol     2023 non-null   float64       
 31  Total_spendings      2023 non-null   float64       
 32  Total_offers         2023 non-null   int64         
 33  Amount_per_purchase  2023 non-null   float64       
dtypes: datetime64[ns](1), float64(14), int64(16), object(2), period[M](1)
memory usage: 553.2+ KB
In [ ]:
df_copy
Out[ ]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... Complain Response Year_Month Customer_age Total_kids Family_Size Time_since_enrol Total_spendings Total_offers Amount_per_purchase
0 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635.0 79.5 ... 0 1 2012-09 67 0 1 11.75 1560.375 1 62.42
1 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11.0 1.0 ... 0 0 2014-03 70 2 3 10.25 27.000 0 4.50
2 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426.0 49.0 ... 0 0 2013-08 59 0 2 10.79 776.000 0 36.95
3 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11.0 4.0 ... 0 0 2014-02 40 1 3 10.32 53.000 0 6.62
4 1981 PhD Married 58293.0 1 0 2014-01-19 94 173.0 43.0 ... 0 0 2014-01 43 1 3 10.38 422.000 0 22.21
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2233 1977 Graduation Together 118043.5 1 0 2013-06-02 23 9.0 14.0 ... 0 0 2013-06 47 1 3 11.01 62.000 0 5.64
2235 1967 Graduation Married 61223.0 0 1 2013-06-13 46 709.0 43.0 ... 0 0 2013-06 57 1 3 10.98 1185.375 0 65.85
2237 1981 Graduation Divorced 56981.0 0 0 2014-01-25 91 908.0 48.0 ... 0 0 2014-01 43 0 1 10.36 1241.000 1 65.32
2238 1956 Master Together 69245.0 0 1 2014-01-24 8 428.0 30.0 ... 0 0 2014-01 68 1 3 10.36 843.000 0 36.65
2239 1954 PhD Married 52869.0 1 1 2012-10-15 40 84.0 3.0 ... 0 1 2012-10 70 2 4 11.64 172.000 1 15.64

2023 rows × 34 columns

In [ ]:
df_copy.columns
Out[ ]:
Index(['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Response', 'Year_Month', 'Customer_age',
       'Total_kids', 'Family_Size', 'Time_since_enrol', 'Total_spendings',
       'Total_offers', 'Amount_per_purchase'],
      dtype='object')
In [ ]:
#Multivariate analysis with new variables
# correlation between all variables: bivariate analysis for numerical values, removing categorical variables, to focus on significant data
fig = plt.figure(figsize=(17,5))

sns.heatmap(df_copy[['Income', 'Kidhome',
       'Teenhome', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
        'Customer_age', 'Family_Size', 'Time_since_enrol', 'Total_spendings', 'Total_offers',
       'Amount_per_purchase']].corr(), annot = True, cmap='coolwarm', vmin=-1, vmax=1)
plt.xticks(rotation = 60)
plt.show()

Time since enrollement is positively correlated with Web purchases, Deals purchases, number of visits per month

Important Insights from EDA and Data Preprocessing¶

Insights have been reported above after each graph.

All products purchases are correlated with each others and can be grouped. Main significant offers are cmp5 and response, but they can all be groupes together.

Complain and recency are not correlated with anything else and can be dropped. Family size can replace Kidhome and Teenhome and reflect marital status partially.

Income and age seem to be inversely correlated and I suspect they will be the main characteritics to identify clusters.

Among purchases, there seem to be one group behaving the same way: store, catalog and web on one side and deals on the other side. They seem to be inversely correlated.

Higher income seem to be correlated with higher age, more products purchase, less deals purchases, less kids at home and higher education.

Data Preparation for Segmentation¶

  • The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
  • Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
  • Plot the correlation plot after we've removed the irrelevant variables
  • Scale the Data

Choosing segmentation attributes: Time since enrollement, amount per purchase, Total spending, Total offers, NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth'

Choosing the profiling attributes: 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Customer_age','Family_Size'

In [ ]:
# Data selected from df_copy and that will be used for analysis: df-segm
df_segm = df_copy[['Income','Kidhome','Education', 'Marital_Status',
       'Teenhome', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth','MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds','Recency',
        'Customer_age', 'Family_Size', 'Time_since_enrol', 'Total_spendings', 'Total_offers',
       'Amount_per_purchase']]
In [ ]:
# Attribute data:
df_attr = df_segm[['NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
        'Time_since_enrol', 'Total_offers',
       'Amount_per_purchase']]
In [ ]:
# Profile data selected from df_copy and that will be used for clusters profile: df_profile
df_profile = df_segm[['Education', 'Marital_Status', 'Income', 'Customer_age','Family_Size']]
In [ ]:
df_profile.reset_index(drop=True, inplace=True)
df_profile
Out[ ]:
Education Marital_Status Income Customer_age Family_Size
0 Graduation Single 58138.0 67 1
1 Graduation Single 46344.0 70 3
2 Graduation Together 71613.0 59 2
3 Graduation Together 26646.0 40 3
4 PhD Married 58293.0 43 3
... ... ... ... ... ...
2018 Graduation Together 118043.5 47 3
2019 Graduation Married 61223.0 57 3
2020 Graduation Divorced 56981.0 43 1
2021 Master Together 69245.0 68 3
2022 PhD Married 52869.0 70 4

2023 rows × 5 columns

In [ ]:
#get dummies
df_attr = pd.get_dummies(df_attr, drop_first =True)
In [ ]:
# Scaling the data and storing the output as a new DataFrame

scaler = StandardScaler()

data_scaled = pd.DataFrame(scaler.fit_transform(df_attr), columns = df_attr.columns)

data_scaled.head()
Out[ ]:
NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Time_since_enrol Total_offers Amount_per_purchase
0 0.500769 1.475284 2.700124 -0.555750 0.726408 1.532331 0.613153 1.192302
1 -0.151133 -1.164865 -0.597955 -1.174167 -0.130707 -1.174695 -0.509389 -1.024993
2 -0.803035 1.475284 -0.231501 1.299501 -0.559265 -0.200166 -0.509389 0.217259
3 -0.151133 -0.787701 -0.964408 -0.555750 0.297851 -1.048367 -0.509389 -0.943835
4 1.804572 0.343792 0.134952 0.062667 -0.130707 -0.940086 -0.509389 -0.347018
In [ ]:
data_scaled_copy= data_scaled.copy(deep=True)

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Applying T-SNE¶

In [ ]:
#Let's try to visualize the data for different perplexity values

for i in range(10, 50, 5):
    tsne = TSNE(n_components = 2, random_state = 1, perplexity = i)

    data_scaled_copy_tsne = tsne.fit_transform(data_scaled_copy)

    data_scaled_tsne = pd.DataFrame(data_scaled_copy_tsne)

    data_scaled_tsne.columns = ['X1', 'X2']

    plt.figure(figsize = (5,5))

    sns.scatterplot(x = 'X1', y = 'X2', data = data_scaled_tsne)

    plt.title("perplexity = {}".format(i))

Observation and Insights: It is very difficukt to identify clearly any cluster with any perplexity

Applying PCA¶

Think about it:

  • Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?

We will apply clustering algorithms on pca data as no clear cluster can be easily identified from tSNE, and there are many fetaures on orignal data

In [ ]:
# Defining the number of principal components to generate
n = data_scaled_copy.shape[1]

# Finding principal components for the data
pca1 = PCA(n_components = n, random_state = 42)
data_scaled_copy_pca = pd.DataFrame(pca1.fit_transform(data_scaled_copy))


data_scaled_copy_pca

# The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_
exp_var1
Out[ ]:
array([0.3895066 , 0.21586571, 0.12276702, 0.09498901, 0.061424  ,
       0.04817276, 0.03760428, 0.02967062])
In [ ]:
# Visualize the explained variance by individual components
plt.figure(figsize = (10, 8))

plt.plot(range(1, 9), pca1.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')

plt.title("Explained Variances by Components")

plt.xlabel("Number of Components")

plt.ylabel("Cumulative Explained Variance")

plt.show()

First 2 components explain 60% of variance of data, 3 components 60%. This is not very satisfactory. 70% is achieved with 3 components

In [ ]:
data_scaled_copy_pca
Out[ ]:
0 1 2 3 4 5 6 7
0 2.142691 1.861015 1.150941 -0.397170 0.187128 -1.885127 -1.108254 -0.588348
1 -1.858299 -1.334782 -0.292856 0.454520 0.412357 -0.249644 -0.248203 0.042585
2 1.158323 -0.036337 -1.050881 -0.192170 -1.505534 0.476404 0.146352 0.473095
3 -1.747297 -0.834835 -0.412753 0.481666 -0.105158 0.204336 0.051958 -0.111660
4 -0.255112 0.691678 -1.447538 1.090458 0.843309 -0.371551 0.010073 0.016424
... ... ... ... ... ... ... ... ...
2018 -1.614064 0.669353 -0.315548 0.114864 0.760880 -0.207679 -0.162363 0.372103
2019 1.015851 0.573296 -0.201157 -0.052381 -1.110435 -1.558725 0.608450 0.627389
2020 1.471442 -0.926131 -0.086432 0.264990 -0.342222 1.455138 1.081567 -1.689674
2021 1.493209 -0.572487 -1.536689 0.253703 -0.295094 0.164763 -0.310308 -0.146040
2022 -1.004285 1.083098 1.159305 -0.337881 0.453424 0.408686 -0.104356 0.321399

2023 rows × 8 columns

In [ ]:
#Display coefficients for each PCA
pca1.components_
Out[ ]:
array([[-0.07705304,  0.33832866,  0.49335718,  0.43501722, -0.35502282,
         0.06265626,  0.28515931,  0.48546104],
       [ 0.58548139,  0.4264846 , -0.01277399,  0.13064522,  0.45308201,
         0.50055509,  0.00856716, -0.04667951],
       [-0.31066639, -0.1811064 , -0.01099432, -0.37609205,  0.27173606,
         0.37301149,  0.69924042,  0.16494099],
       [ 0.32594832,  0.2583624 , -0.06733621, -0.084557  ,  0.13137816,
        -0.71833872,  0.52128916, -0.1015351 ],
       [ 0.64271357, -0.60295541,  0.24764228, -0.10927976, -0.34213284,
         0.10570897,  0.14671606,  0.01845023],
       [-0.08147448, -0.25398162, -0.35369141,  0.73436696, -0.03990672,
         0.10653671,  0.33430433, -0.37384596],
       [ 0.16006579, -0.12411817, -0.62962583,  0.08956788,  0.02362959,
        -0.07235858, -0.06735089,  0.73769388],
       [ 0.05637887,  0.40078266, -0.41100788, -0.29298026, -0.67702954,
         0.25022086,  0.14181002, -0.20084874]])
In [ ]:
#. better representation of the data PCA and variables
cols = ['PCA1', 'PCA2', 'PCA3', 'PCA4', 'PCA5', 'PCA6', 'PCA7','PCA8']

pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:10], 2),index = data_scaled_copy.columns, columns = cols )
pc1
Out[ ]:
PCA1 PCA2 PCA3 PCA4 PCA5 PCA6 PCA7 PCA8
NumDealsPurchases -0.08 0.59 -0.31 0.33 0.64 -0.08 0.16 0.06
NumWebPurchases 0.34 0.43 -0.18 0.26 -0.60 -0.25 -0.12 0.40
NumCatalogPurchases 0.49 -0.01 -0.01 -0.07 0.25 -0.35 -0.63 -0.41
NumStorePurchases 0.44 0.13 -0.38 -0.08 -0.11 0.73 0.09 -0.29
NumWebVisitsMonth -0.36 0.45 0.27 0.13 -0.34 -0.04 0.02 -0.68
Time_since_enrol 0.06 0.50 0.37 -0.72 0.11 0.11 -0.07 0.25
Total_offers 0.29 0.01 0.70 0.52 0.15 0.33 -0.07 0.14
Amount_per_purchase 0.49 -0.05 0.16 -0.10 0.02 -0.37 0.74 -0.20
In [ ]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data_scaled_copy_pca.iloc[:,0], y=data_scaled_copy_pca.iloc[:,1], data=data_scaled_copy_pca)
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()

We can see one dense cluster on the left side of the graph and more spread data on the right

Observation and Insights:

K-Means¶

Think About It:

  • How do we determine the optimal K value from the elbow curve?
  • Which metric can be used to determine the final K value?
In [ ]:
# Empty dictionary to store the SSE for each value of K
sse = {}

# Iterate for a range of Ks and fit the scaled data to the algorithm.
# Use inertia attribute from the clustering object and store the inertia value for that K
k_means_df = data_scaled_copy_pca.copy()
for k in range(1, 10):
    kmeans = KMeans(n_clusters = k, random_state = 1).fit(k_means_df)

    sse[k] = kmeans.inertia_

# Elbow plot
plt.figure()

plt.plot(list(sse.keys()),list(sse.values()), 'bx-')

plt.xlabel("Number of cluster")

plt.ylabel("SSE")

plt.show()

From Elbow method, 2 or 3 clusters seem the best. Let's see the silhouette score

In [ ]:
# Range of cluster numbers to try
cluster_range = range(2, 11)

silhouette_scores = []

# Perform KMeans clustering for each cluster number and calculate silhouette score
for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(k_means_df)
    silhouette_avg = silhouette_score(k_means_df, cluster_labels)
    silhouette_scores.append(silhouette_avg)

# Create a line plot of the silhouette scores
plt.figure(figsize=(8, 6))
plt.plot(cluster_range, silhouette_scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Different Cluster Numbers')
plt.show()

2 clsuters has the highest silhouette score. So we will choose 2 clusters.

Applying KMeans on the PCA data and visualize the clusters¶

In [ ]:
# visualizing the 2 clusters on PCA graph.As 2 PCAs reflext only 50% of the data, I chose to do a 3D graph
k_means_df = data_scaled_copy_pca.copy(deep=True)
kmeans = KMeans(n_clusters = 2, random_state = 1)

kmeans.fit(k_means_df)

# Adding predicted labels to the original data and the scaled data
k_means_df['KMeans_Labels'] = kmeans.predict(data_scaled_copy_pca)
labels1 = kmeans.labels_

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

ax.set_title('3D PCA Visualization')  # choosing 3D as 70% were best represented with 3 PCAs

ax.scatter(k_means_df.iloc[:,0], k_means_df.iloc[:,1], k_means_df.iloc[:,2], c=labels1, cmap='viridis')
plt.title('3 Component PCA')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel ('Principal Component 3')
plt.grid()
plt.show()
In [ ]:
# As I hesitated with 3 clusters, Let's look at what 3 clusters would look like
# visualizing the 2 clusters on PCA graph.As 2 PCAs reflext only 50% of the data, I chose to do a 3D graph
k_means_df = data_scaled_copy_pca.copy()
kmeans3 = KMeans(n_clusters = 3, random_state = 1)

kmeans3.fit(k_means_df)

# Adding predicted labels to the original data and the scaled data
k_means_df['KMeans_Labels'] = kmeans.predict(data_scaled_copy_pca)
labels1 = kmeans3.labels_

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

ax.set_title('3D PCA Visualization')

ax.scatter(k_means_df.iloc[:,0], k_means_df.iloc[:,1], k_means_df.iloc[:,2], c=labels1, cmap='viridis')
plt.title('3 Component PCA')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel ('Principal Component 3')
plt.grid()
plt.show()

This third cluster stands between the 2 clusters, so does not seem to add new information. I will continue with 2 clusters

Cluster Profiling¶

In [ ]:
# Creating a copy of the original data
df1 = df_segm.copy(deep=True)

# Adding K-Means cluster labels to the K-Means and original dataframes
#k_means_df["KM_segments"] = kmeans.labels_
df1["KM_segments"] = kmeans.labels_
In [ ]:
km_cluster_profile = df1.groupby("KM_segments").mean(numeric_only = True)
In [ ]:
km_cluster_profile['Total_count'] = (df1.groupby("KM_segments")["Total_offers"].count().values)
In [ ]:
km_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
Out[ ]:
  Income Kidhome Teenhome NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds Recency Customer_age Family_Size Time_since_enrol Total_spendings Total_offers Amount_per_purchase Total_count
KM_segments                                            
0 37270.609626 0.732620 0.531194 2.386809 2.755793 0.747772 3.648841 6.436720 77.013369 6.233512 32.981283 9.206328 6.562500 20.777629 48.588235 53.393939 2.912656 10.847175 152.790664 0.194296 13.192308 1122
1 70125.679245 0.085461 0.482797 2.038846 5.748058 4.977802 8.472808 3.895671 586.955605 40.503885 298.227525 60.296892 41.596004 62.447836 49.192009 57.213097 2.203108 10.967836 1100.731410 0.776915 53.792575 901
In [ ]:
df_profile
Out[ ]:
Education Marital_Status Income Customer_age Family_Size
0 Graduation Single 58138.0 67 1
1 Graduation Single 46344.0 70 3
2 Graduation Together 71613.0 59 2
3 Graduation Together 26646.0 40 3
4 PhD Married 58293.0 43 3
... ... ... ... ... ...
2018 Graduation Together 118043.5 47 3
2019 Graduation Married 61223.0 57 3
2020 Graduation Divorced 56981.0 43 1
2021 Master Together 69245.0 68 3
2022 PhD Married 52869.0 70 4

2023 rows × 5 columns

In [ ]:
df1_cat = df1[['Marital_Status', 'Education','Family_Size' ,"KM_segments"]]
df1_cat.head()
Out[ ]:
Marital_Status Education Family_Size KM_segments
0 Single Graduation 1 1
1 Single Graduation 3 0
2 Together Graduation 2 1
3 Together Graduation 3 0
4 Married PhD 3 0
In [ ]:
#fvisualisation of KM clusters by cat values
fig, ax =plt.subplots(1,2,figsize=(10,5))
order = ['Single', 'Widow', 'Divorced', 'Married', 'Together']

for i in range (2):
  plt.subplot(1,2,i+1)
  sns.countplot(x='Marital_Status', data = df1_cat[df1_cat["KM_segments"]==i], palette ='Set1', stat = 'percent', order = order)
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('"KM_segments" = {}'.format(i))


plt.show()

Same profile of marital status between the 2 clusters

In [ ]:
#for i in range (2):
fig, ax =plt.subplots(1,2,figsize=(10,5))

for i in range (2):
  plt.subplot(1,2,i+1)
  sns.countplot(x='Family_Size', data = df1_cat[df1_cat["KM_segments"]==i], palette ='Set1', stat = 'percent')
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('"KM_segments" = {}'.format(i))


plt.show()

Larger family size for cluster 0 (first cluster)

In [ ]:
order = ['Basic', 'Graduation', 'Master', 'PhD']

#for i in range (2):
fig, ax =plt.subplots(1,2,figsize=(10,5))

for i in range (2):
  plt.subplot(1,2,i+1)
  sns.countplot(x='Education', data = df1_cat[df1_cat["KM_segments"]==i], palette ='Set1', stat = 'percent', order = order)
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('"KM_segments" = {}'.format(i))


plt.show()

Comparable education status between the 2 clusters.

Describe the characteristics of each cluster¶

Cluster 0:¶

Age: The average age is 53.

Income: The average income is 38,162.

Family size: On average, there are 2.3 members per household

Recency: The average recency is 49 days.

Purchasing Behavior:

Wines: 93.82

Fruits: 5.85

Meat Products: 33.88

Fish Products: 9.40

Sweet Products: 5.85

Gold Products: 5.85

Purchases:

Deals: 2.85

Web: 3.85

Catalog: 1.85

Store: 4.85

Web Visits:3.85

Cluster 1:¶

AgeThe average age is 57

Income: The average income is 70 126.

Family size:On average, there are On average, there are 2.9 members per household

Purchasing Behavior: Wines: 592.46

Fruits: 51.62

Meat Products: 352.96

Fish Products: 75.63

Sweet Products: 42.96

Gold Products: 44.96

Purchases:

Deals: 2.04

Web: 5.7

Catalog: 4.99

Store: 8.47

Amount per purchase: 49.62

Web Visits: 3.89

Time since enrollment: 10.99

Total spendings:1100.73

Total offers: 0.77

The first cluster contains 901 observations. It has a higher total offers, amount per purchase, total spendings, number of web, catalog and store purchases. These clients have higher income, are older and have less family members, less kids and teens at home.

The second cluster contains 1122 observations. Its attributes are higher number of deals purchases and webs visits per month, but lower web, catalog, store purchases, lower offers and lower amount per purchase. Its profile is lower income,larger family size with more kids and Teens at home,

Time since enrollment is comparable between clusters.

Think About It:

  • Are the K-Means profiles providing any deep insights into customer purchasing behavior or which channels they are using?
  • What is the next step to get more meaningful insights?

Summary of each cluster: These 2 clusters do not differentiate for the different types of channel the clients are using. They spend globally more, take on more offers, are not interested in deals.

K-Medoids¶

In [ ]:
k_med_df = data_scaled_copy_pca.copy(deep=True)
kmed = KMedoids(n_clusters = 2, random_state = 1)

kmed.fit(k_med_df)
Out[ ]:
KMedoids(n_clusters=2, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMedoids(n_clusters=2, random_state=1)
In [ ]:
# Creating a copy of the original data
df2 = df_segm.copy()

# Add K-Medoids cluster labels to K-Medoids data
k_med_df["KMed_segments"] = kmed.labels_
# Add K-Medoids cluster labels to the whole data
df2["KMed_segments"] =  kmed.labels_
Kmedlabels = kmed.labels_

Visualize the clusters using PCA¶

In [ ]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=k_med_df.iloc[:,0], y=k_med_df.iloc[:,1], data=k_med_df, hue = "KMed_segments")
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()
In [ ]:
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

ax.set_title('3D PCA Visualization')

ax.scatter(k_med_df.iloc[:,0], k_med_df.iloc[:,1], k_med_df.iloc[:,2], c=Kmedlabels, cmap='viridis')
plt.title('3 Component PCA')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel ('Principal Component 3')
plt.grid()
plt.show()

Cluster Profiling¶

In [ ]:
kmed_cluster_profile = df2.groupby("KMed_segments").mean(numeric_only = True)
In [ ]:
kmed_cluster_profile["count_in_each_segment"] = (
    df2.groupby("KMed_segments")["Total_offers"].count().values)

kmed_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
Out[ ]:
  Income Kidhome Teenhome NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds Recency Customer_age Family_Size Time_since_enrol Total_spendings Total_offers Amount_per_purchase count_in_each_segment
KMed_segments                                            
0 35765.411825 0.753313 0.481142 2.026504 2.286442 0.605505 3.281346 6.371050 50.101937 5.099388 24.873598 7.481142 5.261850 16.751274 48.604485 53.036697 2.880734 10.795515 109.576835 0.193680 11.779990 981
1 67096.921305 0.153551 0.536468 2.425144 5.785029 4.539347 8.166027 4.301344 543.287908 36.934261 269.968330 55.007678 38.079894 60.599808 49.095010 57.032630 2.329175 11.000144 1013.143234 0.698656 49.628321 1042
In [251]:
Education_order = ['Basic', 'Graduation', 'Master', 'PhD']
print('Education for each label')
plt.subplots(1,2,figsize=(10,5))
for i in range(2):
  plt.subplot(1,2,i+1)
  sns.countplot(x='Education', data = df2[df2["KMed_segments"]==i], palette ='Set1', hue = 'Education', order = Education_order , stat ='percent')
  plt.xticks(rotation=60, fontsize=8)
  plt.title('Education for label {}'.format(i))
plt.show()

Marital_order = ['Single', 'Widow', 'Divorced', 'Married', 'Together']

print('Marital Status for each label')
plt.subplots(1,2,figsize=(10,5))
for i in range(2):
  plt.subplot(1,2,i+1)
  sns.countplot(x='Marital_Status', data = df2[df2["KMed_segments"]==i], palette ='Set1', hue = 'Marital_Status',order = Marital_order, stat ='percent')
  plt.xticks(rotation=60, fontsize=8)
  plt.title('Marital Status for label {}'.format(i))
plt.show()

Family_size = ['1','2','3','4','5']
print('Family Size for each label')
fig, ax =plt.subplots(1,2,figsize=(10,5))
for i in range (2):
  plt.subplot(1,2,i+1)
  sns.countplot(x='Family_Size', data = df2[df2["KMed_segments"]==i], palette ='Set1', stat = 'percent')
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('Family Size for label {}'.format(i))
plt.show()
Education for each label
Marital Status for each label
Family Size for each label

Characteristics of each cluster¶

Cluster 0:¶

Age: The average age is 53.

Income: The average income is 35 765.

Family size: On average, there are 2.88 members per household

Recency: The average recency is 48.6 days.

Purchasing Behavior:

Wines: 50.1

Fruits: 5.099

Meat Products: 24.87

Fish Products: 7.48

Sweet Products: 5.26

Gold Products: 16.75

Purchases:

Deals: 2.02

Web: 2.28

Catalog: 0.60

Store: 3.28

Web Visits:6.37

Time since enrollment: 10.79

Total Spendings: 109.57

Total offers: 0.19

Amount per purchase: 11.77

Cluster 1:¶

Age: The average age is 57.

Income: The average income is 67 096.

Family size: On average, there are 2.3 members per household

Recency: The average recency is 49.09 days.

Purchasing Behavior:

Wines: 543.28

Fruits: 36.93

Meat Products: 269.96

Fish Products: 55

Sweet Products: 38.08

Gold Products: 60.6

Purchases:

Deals: 2.42

Web: 5.78

Catalog: 4.53

Store: 8.16

Web Visits:4.3

Time since enrollment: 11

Total Spendings: 1013.14

Total offers: 0.69

Amount per purchase: 49.62

Summary for each cluster:

First cluster: buy less products, no prefer channel, lower amount spent,higher visits per months, take on less offers, and also less deals. Their profile is younger people, lower income, more kids at home, bigger family. No impact of education or marital status.

second cluster: buy more products, through all channels, higher amount spent, fewer visits per months, more offers, also more deals. Their profile is older people, higher income, less kids at home, smaller family. No impact of education or marital status.

Time since enrollment not different between clusters abit higher for second cluster

Observations and Insights: Comparison with KMeans clusters: Cluster 1 with higher income takes on less deals in KMeans but more deals on KmMedoids.

Hierarchical Clustering¶

  • Find the Cophenetic correlation for different distances with different linkage methods.
  • Create the dendrograms for different linkages
  • Explore different linkages with each distance metric
In [ ]:
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
In [ ]:
#Create new dataframe
hc_df = data_scaled_copy_pca.copy(deep=True)
In [ ]:
hc_df1 = hc_df.copy(deep=True)
In [ ]:
# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

# List of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock", 'cosine']

for metric in distance_metrics:

  for method in linkage_methods:
    Z = linkage(hc_df1, method=method, metric=metric)
    cophenetic_corr, cophenetic_dist = cophenet(Z, pdist(hc_df1, metric=metric))
    print(f'Cophenetic Correlation Coefficient for metric', metric, 'and method', method, 'is', cophenetic_corr )

    #Z = linkage(hc_df1, method=method)
    #cophenetic_corr, cophenetic_dist = cophenet(Z, pdist(hc_df1))
    #print(f'Cophenetic Correlation Coefficient for {method}: {cophenetic_corr}')
Cophenetic Correlation Coefficient for metric euclidean and method single is 0.5739142238459091
Cophenetic Correlation Coefficient for metric euclidean and method complete is 0.5363585727989066
Cophenetic Correlation Coefficient for metric euclidean and method average is 0.7447600511914452
Cophenetic Correlation Coefficient for metric euclidean and method weighted is 0.6255148063460467
Cophenetic Correlation Coefficient for metric chebyshev and method single is 0.442583656212713
Cophenetic Correlation Coefficient for metric chebyshev and method complete is 0.5767882662462652
Cophenetic Correlation Coefficient for metric chebyshev and method average is 0.6531170174345047
Cophenetic Correlation Coefficient for metric chebyshev and method weighted is 0.5586349008606936
Cophenetic Correlation Coefficient for metric mahalanobis and method single is 0.7572491961515623
Cophenetic Correlation Coefficient for metric mahalanobis and method complete is 0.6449531172870345
Cophenetic Correlation Coefficient for metric mahalanobis and method average is 0.7878164560966789
Cophenetic Correlation Coefficient for metric mahalanobis and method weighted is 0.5968961538443791
Cophenetic Correlation Coefficient for metric cityblock and method single is 0.6553089968704862
Cophenetic Correlation Coefficient for metric cityblock and method complete is 0.7213233813561795
Cophenetic Correlation Coefficient for metric cityblock and method average is 0.7573727400751508
Cophenetic Correlation Coefficient for metric cityblock and method weighted is 0.651606647519734
Cophenetic Correlation Coefficient for metric cosine and method single is 0.15814661248802556
Cophenetic Correlation Coefficient for metric cosine and method complete is 0.5643160463925817
Cophenetic Correlation Coefficient for metric cosine and method average is 0.7407623901130885
Cophenetic Correlation Coefficient for metric cosine and method weighted is 0.5862487308320113

The highest correlation is for mahalanobis metric and average method copheneitc coefficient of 0.7878. However, euclidean, cityblock and cosine also have high coefficients

In [ ]:
#dendograms to define number of clusters using selected metrics and average method
selected_metrics = ["euclidean", "mahalanobis", "cityblock", 'cosine']

for metric in selected_metrics:
  plt.figure(figsize = (30, 7))
  Z = linkage(hc_df1, metric = metric, method = 'average')
  dendrogram(Z)
  plt.title('Dendrogram Linkage with average method and distance metric: {}'.format(metric))
  plt.ylabel('Distance')
  plt.show()
In [ ]:
#as ward can be associated only with euclidean:
Z = linkage(data_scaled, method='ward', metric='euclidean')
cophenetic_corr, cophenetic_dist = cophenet(Z, pdist(data_scaled, metric='euclidean'))
print(cophenetic_corr)
0.5786595294496293
In [ ]:
#dendogram for ward method and euclidean distance
Z = linkage(hc_df1, metric = "euclidean", method = 'ward')
dendrogram(Z)
plt.title('Dendrogram')
plt.ylabel('Distance')
plt.show()

The easiest to read are: cosine and average

euclidean and ward 2, 3 clusters or 5 clusters.

In [ ]:
#Let's calculate silhouette score for clusters =2, 3 or 5
# Initializing Agglomerative Clustering with distance as mahalanobis, linkage as average

n_clusters=[2,3,5]
for i in n_clusters:
    HCmodel = AgglomerativeClustering(n_clusters = i, metric = 'mahalanobis', linkage = "average")

    # Fitting on PCA data
    preds = HCmodel.fit_predict(data_scaled_copy_pca)

    score = silhouette_score(data_scaled_copy_pca, preds)             # Calculating the silhouette score

    print('Silhouette score for metric mahalanobis linkage average', 'and' ,i, 'clusters is: ', score )
Silhouette score for metric mahalanobis linkage average and 2 clusters is:  0.4075437966858895
Silhouette score for metric mahalanobis linkage average and 3 clusters is:  0.2795961118812597
Silhouette score for metric mahalanobis linkage average and 5 clusters is:  0.18745764432083314
In [ ]:
#Let's calculate silhouette score for clusters =2, 3 or 5
# Initializing Agglomerative Clustering with distance as euclidean, linkage as ward

n_clusters=[2,3,5]
for i in n_clusters:
    HCmodel = AgglomerativeClustering(n_clusters = i, metric = 'euclidean', linkage = "ward")

    # Fitting on PCA data
    preds = HCmodel.fit_predict(data_scaled_copy_pca)

    score = silhouette_score(data_scaled_copy_pca, preds)             # Calculating the silhouette score

    print('Silhouette score for metric euclidean, linkage ward', 'and' ,i, 'clusters is: ', score )
Silhouette score for metric euclidean, linkage ward and 2 clusters is:  0.2801489803225674
Silhouette score for metric euclidean, linkage ward and 3 clusters is:  0.2775679740517781
Silhouette score for metric euclidean, linkage ward and 5 clusters is:  0.18651916068595786
In [ ]:
#Let's calculate silhouette score for clusters =2, 3 or 5
# Initializing Agglomerative Clustering with distance as cosine, linkage as average

n_clusters=[2,3,5]
for i in n_clusters:
    HCmodel = AgglomerativeClustering(n_clusters = i, metric = 'cosine', linkage = 'average')

    # Fitting on PCA data
    preds = HCmodel.fit_predict(data_scaled_copy_pca)

    score = silhouette_score(data_scaled_copy_pca, preds)             # Calculating the silhouette score

    print('Silhouette score for metric cosine, linkage average', 'and' ,i, 'clusters is: ', score )
Silhouette score for metric cosine, linkage average and 2 clusters is:  0.26120149278785415
Silhouette score for metric cosine, linkage average and 3 clusters is:  0.26765486643315906
Silhouette score for metric cosine, linkage average and 5 clusters is:  0.15657201266832385

As bth cophen coeff and silhouette score are optimal for average nethod, mahalsnobis distance and 2 clusters

In [ ]:
HCmodel = AgglomerativeClustering(n_clusters = 2, metric = 'mahalanobis', linkage = "average")
 # Fitting on PCA data
preds = HCmodel.fit_predict(data_scaled_copy_pca)

score = silhouette_score(data_scaled_copy_pca, preds)             # Calculating the silhouette score

print('Silhouette score is: ',score )

Z = linkage(hc_df1, method="average", metric='mahalanobis')
cophenetic_corr, cophenetic_dist = cophenet(Z, pdist(hc_df1, metric='mahalanobis'))
print(f'Cophenetic Correlation Coefficient for metric mahalanobis and method average', 'is', cophenetic_corr )
Silhouette score is:  0.4075437966858895
Cophenetic Correlation Coefficient for metric mahalanobis and method average is 0.7878164560966789
In [ ]:
# Clustering with 2 clusters, mahalanobis metric and average method

hierarchical = AgglomerativeClustering(n_clusters = 2, metric = 'mahalanobis', linkage = 'average')

hierarchical.fit(hc_df1)
Out[ ]:
AgglomerativeClustering(linkage='average', metric='mahalanobis')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(linkage='average', metric='mahalanobis')
In [ ]:
hc_df1['HCLabels'] = hierarchical.labels_
hc_df1.HCLabels.value_counts()
Out[ ]:
HCLabels
0    2021
1       2
Name: count, dtype: int64

this is not a good spread of the data! Let's see for other options:

euclidean and ward

cosine and average

In [ ]:
hc_df1.drop('HCLabels', axis=1, inplace = True)
In [ ]:
# Clustering with 2 clusters,  euclidean metric and ward method

hierarchical = AgglomerativeClustering(n_clusters = 2, metric = 'euclidean', linkage = 'ward')

hierarchical.fit(hc_df1)
Out[ ]:
AgglomerativeClustering(metric='euclidean')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(metric='euclidean')
In [ ]:
hc_df1['HCLabels'] = hierarchical.labels_
hc_df1.HCLabels.value_counts()
Out[ ]:
HCLabels
0    1066
1     957
Name: count, dtype: int64
In [ ]:
hc_df1.drop('HCLabels', axis=1, inplace = True)
In [ ]:
# Clustering with 3 clusters,  cosine metric and average method

hierarchical = AgglomerativeClustering(n_clusters = 3, metric = 'cosine', linkage = 'average')

hierarchical.fit(hc_df1)
Out[ ]:
AgglomerativeClustering(linkage='average', metric='cosine', n_clusters=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(linkage='average', metric='cosine', n_clusters=3)
In [ ]:
hc_df1['HCLabels'] = hierarchical.labels_
hc_df1.HCLabels.value_counts()
Out[ ]:
HCLabels
1    836
2    620
0    567
Name: count, dtype: int64

this is better. So I go for cosine, average, 3 clusters. And cophenetic coeff was better for 3 clusters

Visualize the clusters using PCA¶

In [ ]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=hc_df1.iloc[:,0], y=hc_df1.iloc[:,1], data=hc_df1, hue = 'HCLabels')
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()

Cluster Profiling¶

In [ ]:
# Creating a copy of the original data
df3 = df_segm.copy()

# Adding hierarchical cluster labels to the HC algorithm and original dataframes
hc_df1['HCLabels'] =  hierarchical.labels_
df3['HCLabels'] = hierarchical.labels_

Observations and Insights:

Characteristics of each cluster¶

In [ ]:
hc_cluster_profile = df3.groupby("HCLabels").mean(numeric_only = True)
In [ ]:
hc_cluster_profile["count_in_each_segment"] = (
    df3.groupby("HCLabels")["Total_offers"].count().values)

hc_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
#hc_cluster_profile.style.highlight_min(color = "lightblue", axis = 0)
Out[ ]:
  Income Kidhome Teenhome NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds Recency Customer_age Family_Size Time_since_enrol Total_spendings Total_offers Amount_per_purchase count_in_each_segment
HCLabels                                            
0 55185.493827 0.375661 0.809524 3.758377 6.393298 2.723104 7.425044 6.269841 418.243827 23.375661 138.026455 30.985009 22.880291 52.358025 48.049383 57.576720 2.841270 11.124744 689.648810 0.373898 32.446208 567
1 33574.723086 0.794258 0.447368 1.867225 1.909091 0.440191 3.049043 6.460526 31.247608 4.312201 18.431818 6.135167 4.477273 13.680024 48.947368 52.334928 2.887560 10.796555 78.284091 0.168660 10.028852 836
2 73616.479839 0.035484 0.319355 1.327419 4.919355 5.503226 8.014516 2.864516 567.723790 42.950000 341.996774 67.676613 45.362903 62.023387 49.474194 56.546774 1.980645 10.836935 1139.860887 0.911290 58.851226 620
In [ ]:
df3_cat = df3[['Marital_Status', 'Education','Family_Size' ,'HCLabels']]
df3_cat.head()
Out[ ]:
Marital_Status Education Family_Size HCLabels
0 Single Graduation 1 2
1 Single Graduation 3 1
2 Together Graduation 2 0
3 Together Graduation 3 1
4 Married PhD 3 0
In [ ]:
#fvisualisation of HC clusters by cat values
fig, ax =plt.subplots(1,2,figsize=(10,5))
order = ['Single', 'Widow', 'Divorced', 'Married', 'Together']

for i in range (3):
  plt.subplot(1,3,i+1)
  sns.countplot(x='Marital_Status', data = df3_cat[df3_cat['HCLabels']==i], palette ='Set1', stat = 'percent', order = order)
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('HCLabels= {}'.format(i))


plt.show()

fig, ax =plt.subplots(1,2,figsize=(10,5))

for i in range (3):
  plt.subplot(1,3,i+1)
  sns.countplot(x='Family_Size', data = df3_cat[df3_cat['HCLabels']==i], palette ='Set1', stat = 'percent')
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('HCLabels = {}'.format(i))


plt.show()

order = ['Basic', 'Graduation', 'Master', 'PhD']

#for i in range (2):
fig, ax =plt.subplots(1,2,figsize=(10,5))

for i in range (3):
  plt.subplot(1,3,i+1)
  sns.countplot(x='Education', data = df3_cat[df3_cat['HCLabels']==i], palette ='Set1', stat = 'percent', order = order)
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('HCLabels = {}'.format(i))


plt.show()

Summary of each cluster:

First cluster:

Higher for: deals pruchase, Web purchases, time since enrollment

Lower for: recency

Intermediate for the following: products purchases, total offers no prefer channel.

Their profile is intermediate between clusters 1 and 2.

Second cluster:

Higher for: web visits per month

Lower for: number of purchases, total amount spent, all channels of purchases, time since enrollment

Their profile is younger people, lower income, higher family size.

Third cluster

Higher for: all types of products purchases, catalog and store purchases, recency, total spending.

Lower for: web visits per month, deals per month

Their profile is older people, higher income, less kids at home, smaller family. No impact of education or marital status.

Time since enrollment not different between clusters a bit higher for first cluster, lower for second cluster

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

In [ ]:
dbscan_df = data_scaled_copy_pca.copy(deep=True)
dbscan_df1 = dbscan_df.copy(deep=True)
In [ ]:
from sklearn.metrics import make_scorer
# Define the parameter grid
param_grid = {
    'eps': np.arange(0.1, 4, 0.1),
    'min_samples': np.arange(2, 11, 1)
}

# Create a custom scorer for silhouette score
silhouette_scorer = make_scorer(silhouette_score)

# Perform grid search
best_score = -1
best_params = None
for eps in param_grid['eps']:
    for min_samples in param_grid['min_samples']:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        clusters = dbscan.fit_predict(dbscan_df)
        if len(set(clusters)) > 1:  # Silhouette score is only valid if there is more than one cluster
            score = silhouette_score(dbscan_df, clusters)
            if score > best_score:
                best_score = score
                best_params = {'eps': eps, 'min_samples': min_samples}

print('Best Silhouette Score: ', best_score)
print('Best Parameters: ', best_params)
Best Silhouette Score:  0.42973238124276447
Best Parameters:  {'eps': 3.5000000000000004, 'min_samples': 4}

Apply DBSCAN for the best hyperparameter and visualize the clusters from PCA¶

In [ ]:
df5 = df_segm.copy()
dbscan = DBSCAN(eps=3.5, min_samples=4)
dbscan_df["db_segments"] = dbscan.fit_predict(dbscan_df)
# Add DBSCAN cluster labels to whole data
df5["db_segments"] =  dbscan.fit_predict(dbscan_df1)
In [ ]:
db_cluster_profile = df5.groupby("db_segments").mean(numeric_only = True)
In [ ]:
db_cluster_profile['Total_elements_per_cluster'] = df5.groupby("db_segments")['Total_offers'].count().values
In [ ]:
db_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
Out[ ]:
  Income Kidhome Teenhome NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds Recency Customer_age Family_Size Time_since_enrol Total_spendings Total_offers Amount_per_purchase Total_elements_per_cluster
db_segments                                            
-1 6835.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 13.000000 107.000000 2.000000 12.000000 2.000000 2.000000 12.000000 76.000000 67.000000 3.000000 11.490000 137.000000 0.000000 137.000000 1
0 51925.828882 0.444609 0.509397 2.232938 4.090504 2.633037 5.799703 5.301187 304.227992 21.506429 151.184965 31.975767 22.175631 39.350148 48.843719 55.089021 2.596439 10.900623 575.199369 0.454006 31.222493 2022
In [ ]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=dbscan_df1.iloc[:,0], y=dbscan_df1.iloc[:,1], data=dbscan_df, hue = "db_segments")
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()

DBSCAN shows 2 clusters but the last one contains only 1 observations. Not a good technique

Gaussian Mixture Model¶

In [ ]:
gmm_df = data_scaled_copy_pca.copy(deep=True)
gmm_df1 = gmm_df.copy(deep=True)
df6 = df_segm.copy(deep=True)
In [ ]:
gmm_df
Out[ ]:
0 1 2 3 4 5 6 7
0 2.142691 1.861015 1.150941 -0.397170 0.187128 -1.885127 -1.108254 -0.588348
1 -1.858299 -1.334782 -0.292856 0.454520 0.412357 -0.249644 -0.248203 0.042585
2 1.158323 -0.036337 -1.050881 -0.192170 -1.505534 0.476404 0.146352 0.473095
3 -1.747297 -0.834835 -0.412753 0.481666 -0.105158 0.204336 0.051958 -0.111660
4 -0.255112 0.691678 -1.447538 1.090458 0.843309 -0.371551 0.010073 0.016424
... ... ... ... ... ... ... ... ...
2018 -1.614064 0.669353 -0.315548 0.114864 0.760880 -0.207679 -0.162363 0.372103
2019 1.015851 0.573296 -0.201157 -0.052381 -1.110435 -1.558725 0.608450 0.627389
2020 1.471442 -0.926131 -0.086432 0.264990 -0.342222 1.455138 1.081567 -1.689674
2021 1.493209 -0.572487 -1.536689 0.253703 -0.295094 0.164763 -0.310308 -0.146040
2022 -1.004285 1.083098 1.159305 -0.337881 0.453424 0.408686 -0.104356 0.321399

2023 rows × 8 columns

In [ ]:
#Defining best number of clusters
n_clusters = range(2, 11)
sil_scores = []

for n in n_clusters:
    gmm = GaussianMixture(n, covariance_type='full', random_state=42).fit(gmm_df1)
    labels = gmm.predict(gmm_df1)
    sil_scores.append(silhouette_score(gmm_df1, labels))

plt.plot(n_clusters, sil_scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for GMM')
plt.show()

Silhouette score is the highest for 2 clusters. Second are 3 and 9.

In [ ]:
# Let's apply Gaussian Mixture
gmm = GaussianMixture(n_components = 2, random_state = 1)  # Initializing the Gaussian Mixture algorithm with n_components = 4

gmm.fit(gmm_df)

gmm_df["GMM_segments"] = gmm.predict(gmm_df)
df6["GMM_segments"] = gmm.predict(gmm_df1)
In [ ]:
gmm_df
Out[ ]:
0 1 2 3 4 5 6 7 GMM_segments
0 2.142691 1.861015 1.150941 -0.397170 0.187128 -1.885127 -1.108254 -0.588348 1
1 -1.858299 -1.334782 -0.292856 0.454520 0.412357 -0.249644 -0.248203 0.042585 0
2 1.158323 -0.036337 -1.050881 -0.192170 -1.505534 0.476404 0.146352 0.473095 1
3 -1.747297 -0.834835 -0.412753 0.481666 -0.105158 0.204336 0.051958 -0.111660 0
4 -0.255112 0.691678 -1.447538 1.090458 0.843309 -0.371551 0.010073 0.016424 0
... ... ... ... ... ... ... ... ... ...
2018 -1.614064 0.669353 -0.315548 0.114864 0.760880 -0.207679 -0.162363 0.372103 0
2019 1.015851 0.573296 -0.201157 -0.052381 -1.110435 -1.558725 0.608450 0.627389 1
2020 1.471442 -0.926131 -0.086432 0.264990 -0.342222 1.455138 1.081567 -1.689674 1
2021 1.493209 -0.572487 -1.536689 0.253703 -0.295094 0.164763 -0.310308 -0.146040 1
2022 -1.004285 1.083098 1.159305 -0.337881 0.453424 0.408686 -0.104356 0.321399 0

2023 rows × 9 columns

Visualize the clusters using PCA¶

In [ ]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=gmm_df.iloc[:,0], y=gmm_df.iloc[:,1], data=gmm_df, hue = "GMM_segments")
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()

Observations and Insights:

2 clsuters. Some orange points are spread on the left on the graph, within the blue cluster

Cluster Profiling¶

In [ ]:
gmm_cluster_profile = df6.groupby("GMM_segments").mean(numeric_only = True)
gmm_cluster_profile
Out[ ]:
Income Kidhome Teenhome NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth MntWines MntFruits ... MntFishProducts MntSweetProducts MntGoldProds Recency Customer_age Family_Size Time_since_enrol Total_spendings Total_offers Amount_per_purchase
GMM_segments
0 36360.455217 0.768701 0.511811 2.285433 2.488189 0.616142 3.474409 6.400591 58.196850 5.604823 ... 8.060531 5.607160 17.617618 49.360236 53.085630 2.933071 10.838750 123.634719 0.175197 11.615512
1 67585.539722 0.117180 0.507448 2.177756 5.703078 4.665343 8.141013 4.199603 552.262165 37.530785 ... 56.074975 38.872145 61.249752 48.349553 57.122145 2.257200 10.963635 1030.364697 0.734856 51.109752

2 rows × 21 columns

In [ ]:
gmm_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
Out[ ]:
  Income Kidhome Teenhome NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds Recency Customer_age Family_Size Time_since_enrol Total_spendings Total_offers Amount_per_purchase
GMM_segments                                          
0 36360.455217 0.768701 0.511811 2.285433 2.488189 0.616142 3.474409 6.400591 58.196850 5.604823 28.540354 8.060531 5.607160 17.617618 49.360236 53.085630 2.933071 10.838750 123.634719 0.175197 11.615512
1 67585.539722 0.117180 0.507448 2.177756 5.703078 4.665343 8.141013 4.199603 552.262165 37.530785 274.787488 56.074975 38.872145 61.249752 48.349553 57.122145 2.257200 10.963635 1030.364697 0.734856 51.109752
In [ ]:
df6_cat = df6[['Marital_Status', 'Education','Family_Size' ,"GMM_segments"]]
df6_cat.head()
Out[ ]:
Marital_Status Education Family_Size GMM_segments
0 Single Graduation 1 1
1 Single Graduation 3 0
2 Together Graduation 2 1
3 Together Graduation 3 0
4 Married PhD 3 0
In [ ]:
#fvisualisation of GMM clusters by cat values
fig, ax =plt.subplots(1,2,figsize=(10,5))
order = ['Single', 'Widow', 'Divorced', 'Married', 'Together']

for i in range (2):
  plt.subplot(1,2,i+1)
  sns.countplot(x='Marital_Status', data = df6_cat[df6_cat["GMM_segments"]==i], palette ='Set1', stat = 'percent', order = order)
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('GMM_segments= {}'.format(i))


plt.show()

fig, ax =plt.subplots(1,2,figsize=(10,5))

for i in range (2):
  plt.subplot(1,2,i+1)
  sns.countplot(x='Family_Size', data = df6_cat[df6_cat["GMM_segments"]==i], palette ='Set1', stat = 'percent')
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('GMM_segments = {}'.format(i))


plt.show()

order = ['Basic', 'Graduation', 'Master', 'PhD']

#for i in range (2):
fig, ax =plt.subplots(1,2,figsize=(10,5))

for i in range (2):
  plt.subplot(1,2,i+1)
  sns.countplot(x='Education', data = df6_cat[df6_cat["GMM_segments"]==i], palette ='Set1', stat = 'percent', order = order)
  plt.xticks(rotation=60, fontsize=8)
  plt.title ('GMM_segments = {}'.format(i))


plt.show()

Observations and Insights:

Characteristics of each cluster¶

Summary of each cluster:

First cluster: buy less products, no prefered channel, lower amount spent, higher visits per months and take more deals, take on less offers. Recency is higher. Their profile is younger people, lower income, more kids at home, bigger family. No impact of education or marital status.

Second cluster: buy more products, through all channels, higher amount spent, fewer visits per months, take on more offers, and less deals. Their profile is older people, higher income, less kids at home, smaller family. No impact of education or marital status.

Time since enrollment not different between clusters a bit higher for second cluster

This profile is comparable to the KMeans clusters profiles

Conclusion and Recommendations¶

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
In [ ]:
kmeans = KMeans(n_clusters = 2, random_state = 1, n_init = 'auto')        # Initializing K-Means with number of clusters as 4 and random_state=1

preds = kmeans.fit_predict((data_scaled_copy_pca))                   # Fitting and predicting K-Means on data_pca

score = silhouette_score(data_scaled_copy_pca, preds)                # Calculating the silhouette score

print(score)
0.3073648203227294
In [ ]:
kmedoids = KMedoids(n_clusters = 2, random_state = 1)   # Initializing K-Medoids with number of clusters as 4 and random_state=1

preds = kmedoids.fit_predict((data_scaled_copy_pca))                # Fitting and predicting K-Medoids on data_pca

score = silhouette_score(data_scaled_copy_pca, preds)               # Calculating the silhouette score

print(score)
0.289221596360021
In [ ]:
# Initializing Agglomerative Clustering with distance as Euclidean, linkage as ward with clusters = 4
HCmodel = AgglomerativeClustering(n_clusters = 3, metric = "cosine", linkage = "average")

# Fitting on PCA data
preds = HCmodel.fit_predict(data_scaled_copy_pca)

score = silhouette_score(data_scaled_copy_pca, preds)             # Calculating the silhouette score

print(score)
0.2657754498010657
In [ ]:
# Initializing Gaussian Mixture algorithm with number of clusters as 2 and random_state = 1
gmm = GaussianMixture(n_components=2, random_state=1)

# Fitting and predicting Gaussian Mixture algorithm on data_pca
preds = gmm.fit_predict((data_scaled_copy_pca))

# Calculating the silhouette score
score = silhouette_score(data_scaled_copy_pca, preds)

# Printing the score
print(score)
0.288121070248977

2. Refined insights:

  • What are the most meaningful insights from the data relevant to the problem?

3. Proposal for the final solution design:

  • What model do you propose to be adopted? Why is this the best solution to adopt?

Based on the silhouette score, we can see that K-Means algorithm with 2 clusters is giving the best score on the data. We would proceed K-Means with 2 clusters as the best algorithm.

We have identified 2 clusters:

People who spend lots of money to buy al kinds of products, especially expensive products such as Wine or meat. They reach the store though all channels. They visit the website less often, and everytime they buy a produt they spend a higher amount of money per purchase than the other cluster. These people are older, higher income, smaller family size. Their level of education or marital status does not seem to impact their purchase behavior. They started buying products a longer time ago.

Second cluster refer to people who buy less products, and spend less money each time they buy. They visit the website more often maybe to look for deals, as they respond more to deals. They do not seem to have a prefered channel. These people are younger, they have a bigger family and a lower income.